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Abstract 

Since the penahzed hkehhood function of the smoothly chpped absolute deviation (SCAD) 
penalty is highly non-linear and has many local optima, finding a local solution to achieve the 
so-called oracle property is an open problem. We propose an iterative algorithm, called the 
OEM algorithm, to fill this gap. The development of the algorithm draws direct impetus from 
a missing-data problem arising in design of experiments with an orthogonal complete matrix. 
In each iteration, the algorithm imputes the missing data based on the current estimates 
of the parameters and updates a closed-form solution associated with the complete data. 
By introducing a procedure called active orthogonization, we make the algorithm broadly 
applicable to problems with arbitrary regression matrices. In addition to the SCAD penalty, 
the proposed algorithm works for other penalties like the MCP, lasso and nonnegative gar- 
rote. Convergence and convergence rate of the algorithm are examined. The algorithm has 
several unique theoretical properties. For the SCAD and MCP penalties, an OEM sequence 
can achieve the oracle property after sufficient iterations. For various penalties, an OEM 
sequence converges to a point having grouping coherence for fully aliased regression matri- 
ces. For computing the ordinary least squares estimator with a singular regression matrix, 
an OEM sequence converges to the Moore-Penrose generalized inverse-based least squares 
estimator. 

KEY WORDS: Design of experiments; MCP; Missing data; Optimization; Oracle property; 
Orthogonal design; SCAD; The EM algorithm; The Lasso. 
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1 INTRODUCTION 

Fan and Li (2001) proposed the smoothly chpped absolute deviation (SCAD) penalty to 
achieve simultaneous estimation and variable selection. Consider a linear model 

Y = X|3 + £, (1) 

where X = (xij) is the n x p regression matrix, Y G is the response vector, |3 = 
(/3i, . . . , /3p)' is the vector of regression coefficients and the distribution of the vector of 
random error e = (ei, . . . ,£„)' is N{On, cr^In) with 0„ being the nth zero vector and /„ 
being the n x n identity matrix. Throughout, let || ■ || denote the Euclidean norm. A 
regularized least squares estimator of (3 with this penalty is given by solving 



mm 
P 



Xpf + 2Vp,(|/3,|) 



(2) 



where for ^ > 0, 

P'^{e) = XI{e ^ A) + {aX - 9)+I{e > X)/{a - 1), (3) 

a > 2, A > is the tuning parameter and / is the indicator function. In order to apply the 
penalty Pa equally on all the variables, X can be standardized so that 

n 

= 1, for j = (4) 

i=l 

Both theory and computation of the estimator in ([2]) have been actively studied. On the the- 
oretical side. Fan and Li (2001) introduced an important concept, called the oracle property. 
An estimator of |3 having this property can not only select the correct submodel asymptoti- 
cally, but also estimate the nonzero coefficients as efficiently as if the correct submodel were 
known in advance. On the computational side, existing algorithms for solving this optimiza- 
tion problem include local quadratic approximation (Fan and Li 2001; Hunter and Li 2005), 
local linear approximation (Zou and Li 2008), the coordinate descent algorithm (Tseng 2001; 
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Tseng and Yun 2009; Breheny and Huang 2010; Mazumder, Friedman, and Hastie 2010) and 
the minimization by iterative soft thresholding (MIST) algorithm (Schifano, Strawderman, 
and Wells 2010), among others. 

Departing from the existing work, we study the SCAD penalty from a new perspective, 
targeting on the interface between theory and computing. Fan and Li (2001) proved that 
there exists a local solution to (j2]) with the oracle property. From the optimization viewpoint, 
([2]) can have many local minima (Huo and Chen 2010) and it is very challenging to find one 
of them to achieve the oracle property. To the best of our knowledge, no theoretical results 
are available to show that any existing algorithm can provide such a local minimum. We 
propose an iterative algorithm, called orthogonalizing EM (OEM), to fill this gap. We will 
show in Section |4] that the OEM solution to ([2]) can indeed achieve the oracle property 
under regularity conditions. OEM draws its direct impetus from a missing data problem 
with a complete orthogonal design arising in design of experiments. Throughout, a matrix 
is orthogonal if its columns are orthogonal. In each iteration, the algorithm imputes the 
missing data based on the current estimate of |3 and updates a closed-form solution to 
(j2]) associated with the complete data. Much beyond this orthogonal design formulation, 
the OEM algorithm applies to general data structures by actively orthogonalizing arbitrary 
regression matrices. 

Though the inspiration of the OEM algorithm stems from the SCAD penalty, it, not 
surprisingly, works for the general penalized regression problem: 



where |3 G O, 6 is a subset of and A is the vector of tuning parameters. Besides the 
SCAD penalty, choices for P(|3; A) include the ridge regression (Hoerl and Kennard 1970), 
the nonnegative garrote (Breiman 1995), the lasso (Tibshirani 1996) and the MCP (Zhang 
2010). Algorithms for solving the problem in ([5]) include those developed in Fu (1998), 
Grandvalet (1998), Osborne, Presnell, and Turlach (2000), the LARS algorithm introduced 
in Efron, Hastie, Johnstone, and Tibshirani (2004) and the coordinate descent algorithm 




(5) 
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(Tseng 2001; Friedman, Hastie, Hofling and Tibshirani 2007; Wu and Lange 2008; Tseng 
and Yun 2009), and are available in R packages like lars (Hastie and Efron 2011), glmnet 
(Friedman, Hastie, and Tibshirani 2011) and scout (Witten and Tibshirani 2011). 

In addition to achieving the oracle property for the SCAD and MCP penalties, the OEM 
algorithm has several other unique theoretical features. 1. Having grouping coherence: An 
estimator p of p in ([T]) is said to have grouping coherence if it has the same coefficient 
for full aliased columns in X (Zou and Hastie 2005). For the lasso, SCAD and MCP, 
an OEM sequence converges to a point having grouping coherence. 2. Convergence in 
singular case: When in ([1]) is singular, the ordinary least squares estimator given by ([5]) 
without any penalty is not unique. For this singular case, an OEM solution, or essentially the 
Healy-Westmacott estimator (Healy and Westmacott 1956), converges to the Moore-Penrose 
generalized inverse-based least squares estimator. 

The remainder of the article will unfold as follows. Section [2] derives the OEM algo- 
rithm for a missing data problem with a complete orthogonal design. Section [3] significantly 
broadens the applicability of the algorithm by introducing an idea for actively expanding 
any regression matrix to an orthogonal matrix. Section |4] establishes the oracle property 
of the OEM solution for the SCAD and MCP. Section O provides convergence properties of 
the OEM algorithm. Section |6] shows that for a regression matrix with full aliased columns, 
an OEM sequence for the lasso, SCAD or MCP converges to a solution with grouping co- 
herency and illustrates how to use the OEM algorithm to compute the ordinary least squares 
estimator for a singular regression matrix. Section [7] provides some discussion. 

2 THE OEM ALGORITHM AS A PENALIZED HEALY- 
WESTMACOTT PROCEDURE 

Orthogonal designs are widely used in science and engineering. Such designs have been 
intensively studied in different branches of statistics including design of experiments, infor- 
mation theory (Mac Williams and Sloane 1977), liner models, sampling survey and computer 
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experiments. Popular classes of orthogonal designs include orthogonal arrays (Hedayat, 
Sloane, and Stufken 1999), orthogonal main-effect plans (Addelman 1962; Wu and Hamada 
2009) and orthogonal Latin hypercube designs (Ye 1998; Steinberg and Lin 2006; Bingham, 
Sitter, and Tang 2009; Lin, Mukerjee, and Tang 2009; Pang, Liu, and Lin, 2009; Sun, Liu, 
and Lin 2009; Lin, Bingham, Sitter, and Tang 2010) from the computer experiments litera- 
ture (Santner, Williams, and Notz 2003; Fang, Li, and Sudjianto 2005). In this section, we 
motivate the OEM algorithm by using a missing data problem with an orthogonal complete 
design. Suppose that the matrix X in ([1]) for this problem is a submatrix of an m x p 
complete orthogonal matrix 

X, = {X' A')', (6) 
where A is the [m — n) x p missing matrix. Let 

Y, = (Y', YV J' (7) 

define the vector of complete observations with Ymiss corresponding to A. If Ymiss were 
observable, then the ordinary least square estimator of P based on the complete data {Xc, 
Yc) has a closed form as X^. is orthogonal. In light of this fact, Healy and Westmacott (1956) 
proposed an iterative procedure to compute the ordinary least squares estimator ^qls P- 
In each iteration, their procedure imputes the values of Ymiss and updates the closed-form 
ordinary least squares estimator associated with the complete data. The OEM algorithm 
follows the same idea but solves ([2]) with the SCAD penalty. If Ymiss were observable, then 
X in ([2]) and Y can be replaced by X^ and Yc, yielding a closed-form solution to ([2]). Much 
beyond this orthogonal design formulation, we will significantly broaden the applicability of 
the algorithm in Section [3]by introducing an idea, called active orthogonalization, to actively 
expand any regression matrix into an orthogonal matrix. 
Define 

A = A'A. (8) 

Let {di, . . . , dp) denote the diagonal elements of X'^X^- The OEM algorithm for solving the 
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optimization problem in (|2]) proceeds as follows. Let (3*^°-' be an initial estimate of |3. For 
A; = 0, 1, . . ., impute Y^-,^^ as Y/ = A^^''\ let = (Y', Y^', and solve 



p(/c+i) 



argmm 
P 



X,|3f + 25^P,(|/3,|) 



until converges. Letting 



(9) 



u = [ui, Up)' = X'Y + A|3 



(fc) 



(10) 



(Ml becomes 



13 



(fc+i) 



argmm 



5^(rf,/3j-2«,/3,) + 2 5^P,(|/3, 
.i=i i=i 



which is separable in the dimensions of |3. If X in ([T]) is standardized as in 
for all j, ( iTTl) has a closed- form 



with dj ^ 1 



/3 



(fc+i) 



sign(uj) (lujl — X)^/dj, when ^ {dj + 1)A, 

sign('Uj) [(a — — aA]/[(a — l)dj — l], when {dj + 1)A < ^ aXdj, 



Uj/dj, 



when > aXdj. 



(12) 

As pointed out in Dempster, Laird, and Rubin (1977) that the Healy-Westmacott pro- 
cedure is essentially an EM algorithm, OEM is an EM algorithm as well. The complete 
data Yc = (Y', Y^^^gg)' in ([7]) follow a regression model Yc = XcP + tc, where tc is from 
A^(Om, Im)- Let (3gcAD be a solution to ([2]), where |3gcAD = argmaxpL(|3 | Y) and the 
penalized likelihood function -L((3 | Y) is 



(27r)~"/2 exp 



-||Y-X|3f 



exp 



E^a(|/3,i) 



6 



Given the E-step of the OEM algorithm for the SCAD is 



E[log{L(|3|Y,)} I Y,|3('=)] 
= -C{||Y - Xpf + E(||Y^,,, - X|3f I pW) + 2j2Pxm\)} 

p 

= ||Y-X|3f + ||Ap('') - A|3f + 2^Pa(|/3,|)} 

i=i 

for some constant C > 0. Define 

gscAD(|3 I P^'^)) = ||Y - X|3f + II Ap^'^) - AP||2 + 2j2Pxm)- (13) 

i=i 

The M-step of the OEM algorithm for the SCAD is 

^argminQscAoO | P^'^ 

which is equivalent to (fTTj) . 

For the general penalized regression problem in ([5]), the M-step of the OEM algorithm 
becomes 

|3(fc+i) = argminQ(p | p^'^)), (14) 

where Q replaces Qscad in ( |T3|) for the corresponding penalty function. If 9 and P in ([5]) are 
decomposable as 6 = 11^=1 P(P; A) = Yl^=i Pjit^jj '^)) similarly to (ITT]) , (fl^ reduces 

to p one- dimensional problems 

/jj'^+i) = argmin [d,P] - 2u,P, + P,(/3,; A)], for j = 1, . . . ,p, (15) 

with u = (Mi,...,Mp)' defined in the same way as in (fTOl) . This shortcut applies to the 
following penalties: 
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1. The lasso (Tibshirani 1996), where Qj = M, 

P,(/3,;A) = 2A|/3,|, (16) 

and (fT5|) becomes 



/3f+^) = signK.) . (17) 



kill — A 



dj 



Here, for a G M, (a)+ denotes max{a, 0}. 



2. The nonnegative garrote (Breiman 1995), where 6^ = {x : x/3j ^ 0}, Pj{f3j; X) = 

2\f3j/ (3j, (3j is the ordinary least squares estimator of and (fTSjl becomes 

3. The elastic- net (Zou and Hastie 2005), where 9j = R, 

P,(/3,;A)=2Ai|/3,| + A2/3|. (18) 

and (|T5l) becomes 

5. The MCP (Zhang 2010), where 6^ = M, Pj{l3f A) = 2PA(|/3il), and 

P[{e) = (A - ^/a)J(e ^ a\) (20) 

with a > 1 and 9 > Q. If X in ([T]) is standardized as in (jl]) with ^ 1 for all j, f[T5|) 
becomes 

^(fc+i) ^ I sign(uj)a(|Mj| - X)J{adj - 1), when ^ aXdj, ^^^^ 
Uj/dj, when \uj\ > aXdj 
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6. The "Berhu" penalty (Owen 2006), where 6^ = M, Pj{f3j;\) = 2\{\f3j\I{\f3j\ < 6) + 
{/3] + 6^)I{\l3j\ ^ 6)/{26)} for some 5 > 0, and ([IS]) becomes 



Obviously, if the penalty on (3 disappears, the OEM algorithm reduces to the Healy- 
Westmacott procedure. Quite interestingly. Theorem [6] in Section [5] shows that, for the same 
X and Y in ([T]), the OEM algorithm for the elastic-net and lasso numerically converges 
faster than the Healy-Westmacott procedure. 

Example 1. For the model in ([T]), let the complete matrix X^. be a fractional factorial 
design from Xu (2009) with 4096 runs in 30 factors. Clearly, X^. is an orthogonal design. 
Let X in ([1]) be the submatrix of Xc consisting of the first 3000 rows and let Y be generated 
with (7=1 and 



Here, p = 30 and n = 3000. Assume the response values corresponding to the last 1096 rows 
of Xc are missing. We used the OEM algorithm to solve the optimization problem in ([2]) with 
an initial value P*-"-* =0 and a criterion to stop when relative changes in all coefficients are 
less than 10"^ For A = 1 and a = 3.7 in ([31), Fi gure 1 plots values of the objective function 
in ([2]) of the OEM sequence against iteration numbers, where the algorithm converges at 
iteration 13. 



3 THE GENERAL FORMULATION WITH ACTIVE 
ORTHOGONALIZATION 



The OEM algorithm in Section [2] was derived for a missing-data problem where X in ([T]) 
is imbedded in a pre-specified orthogonal matrix. We drop this assumption in this section 




when \uj\ < A + djS, 
when \uj\ ^ A + dj6. 



(3j = (-l)^ exp [ - 2(j - l)/20] for j = 1, . . . ,p. 



(22) 
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Figure 1. Values of the objective function of an OEM sequence for the SCAD against 
iterations for Example [H 

and further develop the algorithm for general data structures by introducing a procedure to 
actively expand an arbitrary matrix to an orthogonal matrix. The general idea of augmenting 
extra data has been used for EM problems before. For example, for a covariance estimation 
problem in Rubin and Szatrowski (1982), extra data are added elaborately to make the max- 
imum likelihood estimator of the expanded patterned covariance matrices have an explicit 
form. To facilitate the use of the OEM algorithm in Section |2l the contribution here is to 
develop a scheme to orthogonalize the matrix X with an arbitrary structure. 

Take S to he a p x p diagonal matrix with non-zero diagonal elements si, . . . ,Sp. Define 

Z = XS-\ (23) 

The eigenvalue decomposition of Z'Z (Wilkinson 1965) is 

V'TV, 

where V is an orthogonal matrix and F is a diagonal matrix whose diagonal elements, 
7i ^ ■ ■ ■ ^ 7p, are the nonnegative eigenvalues of Z'Z. Let 

^ = #{j: 7, =71, J = l,...,p} (24) 




-© e e e © © e © ©- 
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denote the number of the 7^ equal to 71. For example, if 71 = 72 and 71 > 7j for j = 3, . . . , p, 
then t = 2. Define 

B = diag(7i - 7^+1, . . . , 71 - 7p) (25) 

and 

A = B^/^ViS, (26) 

where Vi is the submatrix of V consisting of the last p — t rows. Let be the augmented 
matrix of A and X. 

Lemma 1. The matrix constructed above is orthogonal. 
Proof. Note that 

X'^Xc = X'X + A'A, 
which, by plugging (!25|) and (!26|) . becomes 

S[Z'Z + F'(7i/p - T)V]S = 71 52 

Now, because 

/o o\ 

7i/p - r = 

\o b) 

( 1271) is orthogonal, which completes the proof. □ 

The underlying geometry of the active orthogolization in Lemma [1] can be described as 
follows. For a vector x G M™, let P^x denote its projection onto a subspace u of M™. This 
lemma implies that for the column vectors oi X in ([T]), Xi, . . . ,Xp G M", there exists a set 
of mutually orthogonal vectors x^, . . . , Xcp G ]R"+^^*, essentially the column vectors of X^. 
in dnj, satisfy the condition that Pr for j = 1, . . . ,p. Proposition 1 makes this 

precise. 

Proposition 1. Let u be an ra-dimensional subspace of M™ with n ^ m. lip^m — n + 1, 
then for any p vectors Xi, . . . , Xp G u, there exist p vectors Xd, . . . , Xcp G M™" such that 
P^Xci = Xi for j = 1, . . . ,p and x^x^j = for i 7^ j. 

11 



(27) 




Figure 2. Expand two two-dimensional vectors Xi and X2 to two three-dimensional vectors 
Xci and Xc2 with x'^^Xc2 = 0. 

Figure 2 expands two vectors Xi and X2 in to two orthogonal vectors x^i and Xc2 in M.^. 

Now, if Xc from Lemma [1] is treated as the complete matrix defined in ([6]), the OEM 
algorithm in Section [2] follows through immediately. 

When using the OEM algorithm to solve ([5]), in (fTO!) instead of computing A in (126|1 . 
one may compute A = A' A and the diagonal entries di, ... ,dp of X'^Xc- Note that 

A = 7i5=^ - X'X (28) 

and 

rfj = 7isJ for j = 1, . . . ,p, (29) 

where S and Z are defined in f l2^ and 71 is the largest eigenvalue of Z'Z = S'^X'XS'^. 
One way to compute 71 is to use the power method (Wilkinson 1965) described below. 
Given a nonzero initial vector a'^'^^ G MP, let 'yf^ = ||a*^'^)||. For k = 0,1,..., compute 
^(fc+i) _ X' XsS^'^ /'^[''^ and 7^'^^"'^^ = ||a*^*^+"'^-'|| until convergence. If a'-^-' is not an eigenvector 
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of any 7^- that does not equal 71, then 7); converges to 71. For t defined in ([23]), the 
convergence rate is hnear (Watkins 2002) specified by 



hm 



7i -71 II 7m 



7i -7il 



71 



An easy way to make A = A' A in (128|) positive definite is to replace B in (125|) by 



B = diag((i - 74+1, . . . , - 7p) 



with ^ 7i, which changes (128|) and (|29|) to 



(30) 



and 



cij = rfs^, for j = 1, . . . ,p, 



(31) 



respectively. If > 71, then A = A' A is positive definite. 



Remark 1. The matrix S in 0231) can be chosen fiexibly. One possibility is to use S = Ip 

so that 



X'X + A'A = dl^ 



(32) 



with d ^ 7i, and Xc/Vd is standardized as in (jl]). 
Example 2. Suppose that X in ([1]) is orthogonal. Take 



S = diag 



/2 



i=l 



(33) 



Since t = p, A in fl26l) is empty. This result indicates the active orthogonalization procedure 
will not overshoot: if X is orthogonal already, it adds no row. 
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Example 3. Let 



X 



/ 








3/2 ^ 




-4/3 


-2/3 


1/6 




2/3 


4/3 


1/6 


\ 


-2/3 


2/3 


-7/6 ^ 



lfS = /3, (EE]) gives A = (-2/v^, 2/v^, 1/v^). 

Example 4. Consider a two-level design in three factors, A, B and C: 



1/ 



The regression matrix including all main effects and two-way interactions is 



X 





-1 


-1 


-1 


1 


1 


1 


\ 




-1 


1 


1 


-1 


-1 


1 






1 


-1 


1 


-1 


1 


-1 




\ 


1 


1 


-1 


1 


-1 


-1 


/ 



where the last three columns for the interactions are fully aliased with the first three columns 
for the main effects. For S = I3, fl26l) gives 



v 



-2 -2 
-2 -2 
-2 -2 



The elements in A are chosen flexibly, not restricted to ±1. 

Example 5. Consider a 1000 x 10 random matrix X = (xij) with entries independently 



14 



drawn from the uniform distribution on [0, 1). Using S in ( 133|) . (!26|) gives 
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Only nine rows need to be added to make this large X matrix orthogonal. 

4 ACHIEVING THE ORACLE PROPERTY WITH 
NONCONVEX PENALTIES 

Fan and Li (2001) introduced an important concept called the oracle property and showed 
that there exists one local minimum of ([2]) with this property. However, because the opti- 
mization problem in (|2]) has an exponential number of local optima (Huo and Ni 2007; Huo 
and Chen 2010), no theoretical results in the literature claim that an existing algorithm can 
provide such a local minimum. In this section, we prove that an OEM sequence for the 
SCAD and MCP can indeed achieve this property. The theoretical results in this and the 
following sections work for the OEM algorithm in both Sections |2] and [31 

First, we describe the oracle concept. A penalized least squares estimator of |3 in ([T]) has 
this property if it can not only select the correct submodel asymptotically, but also estimate 
the nonzero coefficients Pi in ( IMj) as efficiently as if the correct submodel were known in 
advance. Suppose that the number of nonzero coefficients of |3 in ([1]) is pi (with pi ^ p) and 
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partition |3 as 

|3 = O'l, |3'2)', (34) 

where = ^"^^ component of 13^ is zero. Divide the regression matrix X in ([1]) to 
{Xi X2) with Xi corresponding to P^. If all the variables that influence the response in 
([T]) are known, an oracle estimator can be given as P = (P^^, ^2)' with P2 = 0) where 



Pi-|3i~iV(0, a\X[X{^ 



We need several assumptions. 
Assumption 1. As n — )■ 00, 




X'X 

y S 

n 



where SI is positive definite and Si is pi x pi. Furthermore, X j ^fn is standardized such 
that each entry on the diagonal of X' X jn is 1, and X' Xjn + A' A = dip with d ^ 71, 
where d = 0(1) and 71 is the largest eigenvalue of X'X/n. 

Assumption 2. The tuning parameter A = A^, in ([3]) satisfies the condition that, as n — )► 00, 
A„/ n ^ and A^/ y/n — )■ 00. 

Let {|3(^\ A; = 0, 1, . . . , } be the OEM sequence from ^ for the SCAD with a fixed 
a > 2 in ([3]). We need an assumption on k = kn- Let 77 be the largest eigenvalue of 
Ip-^ —X'iXi/{nd). Under Assumption [H r] tends to a limit lying between and 1 as n — )■ 00. 

Assumption 3. As n — )• 00, rf'^Xnl — )• and A;^ exp(— c(A„/i/n)^) — > for any c > 0. 

One choice for fc„ to satisfy Assumption |3] is 

kn = ( — F= ) for some v > 0. 



Under Assumption [2l /c^i — 00 as n — )• 00. 
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Under the above assumptions, Theorem [T] shows that |3*-'^-' = O^'^'' , ^^2 '^ )' can achieve the 
oracle property. 

Theorem 1. Suppose that Assumption [TH hold. If (3^°^ = {X'X)-^X'Y, then as n ^ 00, 

(i) P(|3(') = 0) ^ 1; 

(ii) ^/n{^P - |3i) N{0,a^'E^^) in distribution. 

The proof of Theorem [1] is deferred to the Appendix. 

Remark 2. From ( I6OII in the proof of Theorem [H for k = 1,2,..., (S^'^^ is consistent in 
variable selection. That is, P(/3]*''' 7^ for j = l,...,pi) — 1 and P(|32^'' = 0) — 1 as 
n — )■ 00. 

Remark 3. The proof of Theorem [1] uses the convergence rates of P{Ak), P(-Ba,) and P(C^). 
If an OEM sequence satisfies the condition that /3j'''^^^ = when \uj\ < X and (3^'^^^^ = uj/d 
when \uj\ > c\ for some c = 0(1), then P^A^^i) = P{\uj\ < A) and P(i?fc+i) = Pd^jl > cA). 
Since an OEM sequence for the MCP satisfies the above condition, an argument very similar 
to the proof in the Appendix shows that the convergence rates of P(y4fc), P(-Ba:) and P(C^) 
for the MCP are the same as those with the SCAD. Thus, under Assumption [T]|3l Theorem [1] 
holds for the MCP with a fixed a > 1 in (I20]). 

Huo and Chen (2010) showed that, for the SCAD penalty, solving the global minimum 
of ([5]) leads to an NP-hard problem. Theorem [1] indicates that as far as the oracle property 
is concerned, the local solution given by OEM will suffice. 

5 CONVERGENCE OF THE OEM ALGORITHM 

In this section, we derive theoretical results on convergence properties of the OEM algo- 
rithm and compare the convergence rates of OEM for the ordinary least squares estimator 
and the elastic-net and lasso. The general penalty in is considered here. Our derivations 
employs the main tool in Wu (1983) in conjunction with special properties of the penalties 
mentioned in Section [2j 
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We make several assumptions for G and P(|3; A) in (|5]). 

Assumption 4. The parameter space G is a closed convex subset of MP. 

Assumption 5. For a fixed A, the penalty P((3; A) — > +00 as ||(3|| — > +00. 

Assumption 6. For a fixed A, the penalty P((3; A) is continuous with respect to |3 G G. 

All penalties discussed in Section [2] satisfy these assumptions. The assumptions cover the 
case in which the iterative sequence {p^'^)} defined in ^ may fall on the boundary of G 
(Nettleton 1999), like the nonnegative garrote (Breiman 1995) and the nonnegative lasso 
(Efron et al. 2004). The bridge penalty (Frank and Friedman 1993) in (137|) also satisfies the 
above assumptions. 

For the model in ([T]), denote the objective function in ([5]) by 

/(|3) = ||Y-X(3f + P((3;A). (35) 

For some penalizes like the bridge, it may be numerically infeasible to perform the M-step 
in f|T^ . For this situation, following the generalized EM algorithm in Dempster, Laird, and 
Rubin (1977), we define a generalized OEM algorithm to be an iterative scheme 

^ (36) 

where |3— )'A^(|3)cGisa point-to-set map such that 

g((j) I p) ^ Q(|3 I |3), for all c|) G 7W(P). 

Here, Q is given by replacing the SCAD with P(|3; A) in f|T3|) . The OEM sequence defined 
by ( |T^ is a special case of ( l36l) . For example, the generalized OEM algorithm can be used 
for the bridge penalty, where Gj = R and 

P,(/3,;A) = A|/3,r (37) 
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for some a G (0, 1) in ([5]). Since the solution to (fT5|) with the bridge penalty has no closed 
form, one may use one-dimensional search to compute Z?]'^^^^ that satisfies (136|) . By Assump- 
tion 1, {3 G e : ^ is compact for any /(|3^°^) > -oo. By Assumption El M is 
a closed point-to-set map (Zangwill 1969; Wu 1983). 

The objective functions of the EM algorithms in the literature like those discussed in 
Wu (1983), Green (1990) and McLachlan and Krishnan (2008) are typically continuously 
differentiable. This condition does not hold for the objective function in (I5l) with the lasso 
and other penalties. A more general definition of stationary points is needed here. We call 
P G a stationary point of / if 

/((I -t)(3 + tc|}) -/((3) 
lim inf ^ — ^ for all cjj G 9. 

t^o+ t 

Let S denote the set of stationary points of /. Analogous to Theorem 1 in Wu (1983) on the 
global convergence of the EM algorithm, we have the following result. 

Theorem 2. Let {P^'^''} be a generalized OEM sequence generated by ( l36|) . Suppose that 

/(|3('+'^) < /(P^')) for all 13^'^ eQ\S. (38) 

Then all limit points of {p*-^-*} are elements of S and /(p'''^^) converges monotonically to 
/* = /(p*) for some p* G S. 

Theorem 3. If p* is a local minimum of (^(P | P*), then p* G S. 

This theorem follows from the fact that /(p) — (5(p | P*) is differentiable and 

a[/(P)-Q(PI p*)] 

ap p=p* 

Remark 4. By Theorem[3|, if P*^'^-' ^ S, then p*^'^^ cannot be a local minimum of (5(P | P'''^^). 
Thus, there exists at least one point P^'^^^^ e M{^^^^) such that Q(p(''+^^ | P^'^^) < g(p('^) | 
P*-^-*) and therefore satisfies the condition in (1381) . As a special case, an OEM sequence 
generated by ( IT4|) satisfies (138|1 in Theorem [2j 

19 



Next, we consider the convergence of a generalized OEM sequence {P^'^)} in dMl). By 
Theorem [3l such results will automatically hold for an OEM sequence as well. If the penalty 
function P(|3;A) is convex and /((3) has a unique minimum, Theorem H] shows that {p*^'^-'} 
converges to the global minimum. 

Theorem 4. Let {P*^'^-'} be defined in Theorem O Suppose that /(P) in ( 135|) is a convex 
function on B with a unique minimum P* and that ([38]) holds for {p Then p^^^ ^ p* 
as A; — oo. 

Proof. We only need to show that 5* = {p*}. For c(} G with c|} 7^ p* and t > 0, we have 

/((l-t)c|. + tp*)-/(p-) ^ t/(p-) + (l-t)/(c|.)-/(c|.) ^ ^^p,^ _ ^^^^ ^ ^ 
ij t 

This implies cjj ^ 5. □ 

Theorem [5] discusses the convergence of an OEM sequence {P^"^)} for more general penal- 
ties. For a G M, define S{a) = {fp G S : l{fp) = a}. From Theorem [21 all limit points of an 
OEM sequence are in S{1*), where /* is the limit of /(p^^-*) in Theorem[21 Theorem [S] states 
that the limit point is unique under certain conditions. 

Theorem 5. Let {P^'^-'j be a generalized OEM sequence generated by f[5B]) with A' A > 0. 
If ([38D holds, then all limit points of {p^''^} are in a connected and compact subset of S{1*). 
In particular, if the set S{1*) is discrete in that its only connected components are singletons, 
then P^*^-* converges to some P* in S{1*) as A; —t- 00. 

Proof. Note that Q{^^^+^^ \ p^'^)) = /(p(''+^)) + ||Ap('=+^) - Ap^'^^f ^ Q{^^^^ \ p^'^)) = 
/(p('=)). By Theorem [21 ||Ap('=+^) - Ap^^'^f ^ /(p^^)) - /(p(^'+^)) ^ as k ^ 00. Thus, 
|||3(fc+i) _ |3(fc)|| ^ 0. This theorem now follows immediately from Theorem 5 of Wu (1983). 

□ 

Since the bridge, SCAD and MCP penalties all satisfy the condition that S{1*) is discrete, 
an OEM sequence for any of them converges to the stationary points of /. Theorem [5] is 
obtained under the condition A' A > 0. Since the error e in ([Tj) has a continuous distribution, 
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it is easy to show that Theorem [5] holds with probabihty one if A'A is singular when d defined 
in ( 130|) and (13T|) equals 71. 

We now derive the convergence rate of the OEM sequence in (|T^ . Following Dempster, 
Laird, and Rubin (1977), write 

where the map M(|3) = (Mi(|3), . . . , Mp(|3))' is defined by (fT^ . We capture the convergence 
rate of the OEM sequence {|3^*'''} through M. Assume that (132!) holds for d ^ ji, where 
7i is the largest eigenvalue of X'X. For the active orthogolization in ( I30p and f l3ip . taking 
S = Ip satisfies this assumption; see Remark [T] 

Let (3* be the limit of the OEM sequence {13^'')}. As in Meng (1994), we call 

|||3(^+^)-ril ||M(|3,)-M(r)|| 
it = limsup TTT = limsup-^^ ttt 39 

the global rate of convergence for the OEM sequence. If there is no penalty in i.e., 
computing the ordinary least squares estimator, the global rate of convergence R in fl39l) 
becomes the largest eigenvalue of J((3*), denoted by Rq, where J{<p) is the p x p Jacobian 
matrix for M(c|)) having (i, j)th entry dMi(^) /d(pj. If ( l32l) holds, then J(|3*) = A/d with 
A = A'A. Thus, 

Bo = (40) 

d 

For ([5]), the penalty function P(|3; A) typically is not sufficiently smooth and R in fl39|) 
does not have an analytic form. Theorem [6] gives an upper bound of Rnet, the value of R for 
the elastic-net penalty in ( IT8l) with Ai, A2 ^ 0. 



Theorem 6. For A from (|6]), if (132]) holds, then R^et < ^o- 

Proof. Let Xj denote the jth column of X and aj denote the jth column of A = A'A, 
respectively. For an OEM sequence for the elastic-net, by (fT9l) . 

M,(|3) = /(x' Y + a' |3), for j = l,...,p, 
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where 

\u\ — Ai 



f{u) = sign(M) 
For j = 1, . . . ,p, observe that 



d + X2 



M,(|3('=))-M,-(|3*)| _ |/(x;.Y + a;p('=))-/(x;.Y + a^r 



(x;.Y + a;.|3('=))-(x;.Y + a^P* 



- |3*|| |(x;Y + a^P^^'O - (x;.Y + a;. (3* 



^ 1 la^dsc^) - r) 

^ — ■ 



Thus, 



|||3W_p*|| d |||3(fe) _ |3*|| d 

This completes the proof. □ 

Remark 5. Theorem [6] indicates that, for the same X and Y in ([T]), the OEM solution for 
the elastic-net converges faster than its counterpart for the ordinary least squares. Since the 
lasso is a special case of the elastic-net with A2 = in (fTSjl . this theorem holds for the lasso 
as well. 

Remark 6. From (140|) and Theorem [6], the convergence rate of the OEM algorithm depends 
on the ratio of the smallest eigenvalue, 7^, and the largest eigenvalue, 71, of X'X. This rate 
is the fastest when 71 = 7^, i.e., if X is orthogonal and standardized. This result suggests 
that OEM converges faster if X is orthogonal or nearly orthogonal like from a supersaturated 
design or a nearly orthogonal Latin hypercube design (Owen 1994; Tang 1998). This result is 
in agreement with the recent finding in the design of experiments community that the use of 
orthogonal or nearly orthogonal designs can significantly improve the accuracy of penalized 
variable selection methods (Phoa, Pan, and Xu 2009; Deng, Lin, and Qian 2010; Zhu 2011). 



Example 6. We generate X from p dimensional Gaussian distribution A^(0, V) with n 
independent observations, where the (i, j)th entry of V is 1 for i = j and p for i 7^ j. Values 
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Figure 3. (Left) the average values of Ro in fj^O|) against increasing n for Example [6l (right) 
the average iteration numbers against increasing n for Example O where the dashed and 
solid lines denote the ordinary least squares estimator and the lasso, respectively. 

of Y and |3 are generated by ([1]) and (!22l) . The same setup was used in Friedman, Hastie, 
and Tibshirani (2009). For p = 10, p = 0.1, A = 0.5 and increasing n, the left panel of 
Figure 3 depicts the average values of Rq in ( l40l) against increasing n and the right panel of 
the figure depicts the average iteration numbers against increasing n, with the dashed and 
solid lines corresponding to the ordinary least squares estimator and the lasso, respectively. 
Quite strikingly, this figure indicates that OEM requires fewer iterations as n becomes larger, 
which makes OEM particulary attractive for situations with massive data (SAMSI 2012). It 
is important to note that here the OEM sequence for the lasso requires fewer iteration than 
its counterpart for the ordinary least squares, empirically validating Theorem [61 

6 POSSESSING GROUPING COHERENCE 

In this section, we consider the convergence of the OEM algorithm when the regression 
matrix X in ([T]) is singular due to fully aliased columns or other conditions. Let X be 
standardized as in (jl]) with columns xi, . . . , Xp. If Xj and Xj are fully aliased, i.e., |xj| = |x2|, 
then the objective function in (j5]) for the lasso is not strictly convex and has many minima 
(Zou and Hastie 2005). Data with fully aliasing structures commonly appear in observational 
studies and various classes of experimental designs like supersaturated designs (Wu 1993; 
Lin 1993; Tang and Wu 1997; Li and Lin 2002; Xu and Wu 2005) and factorial designs (Dey 
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and Mukerjee 1999; Mukerjee and Wu 2006). 

Zou and Hastie (2005) states that if some columns of X are identical, it is desirable to 
have grouping coherence by assigning the same coefficient to them. Definition [1] makes this 
precise. 

Definition 1. An estimator P = 0i, . . . , /3p)' of |3 in ([1]) has grouping coherence if Xj = Xj 
implies (3i = (3j and Xj = —Xj implies (3i = 

Let Op denote the zero vector in R^. Let e^- be the vector obtained by replacing the ith 
and jth entries of Op with 1. Let e^j be the vector obtained by replacing the ith and jth 
entries of Op with 1 and —1, respectively. Let S denote the set of all efj and e,^-. By Definition 
dl an estimator (3 has grouping coherence if and only if for any (X E S with X(X = 0, a'P = 0. 

Lemma 2. Suppose that (1321) holds. For the OEM sequence {P^''^} of the lasso, SCAD or 
MCP, if Xa = and a'^^^'^ = for a e then a'p(^+^) = 0. 

Proof. For u defined in i^, we have that a'u = a'X'Y + a'{dl.p - X'X)^^''^ = for any 
aeS with Xa = and a'^^''^ = 0. Then by (HZD, (^2) and (JUD, an OEM sequence of the 
lasso, SCAD or MCP satisfies the condition that if a'u = 0, then a'p^''^^^ =0 ior a E S. 
This completes the proof. □ 

Remark 7. Lemma |2] implies that, for k = 1,2, . . ., (3*^'^^ has grouping coherence if P^^-* has 
grouping coherence. Thus, if {P^'^''} converges, then its limit has grouping coherence. By 
Theorem [5l if > Ai in (132|) . then an OEM sequence for the SCAD or MCP converges to a 
point with grouping coherence. 

When X in ([T]) has fully aliased columns, the objective function in (jS]) for the lasso 
has many minima and hence the condition in Theorem H] does not hold. Theorem [7] shows 
that, even with full aliasing, an OEM sequence ( IT7|) for the lasso converges to a point with 
grouping coherence. 

Theorem 7. Suppose that ([32]) holds. If p^°) has grouping coherence, then as A; — )■ oo, the 
OEM sequence {p^'^)} of the lasso converges to a limit that has grouping coherence. 
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Proof. Partition the matrix X in ([T]) as (Xi X2), where no two columns of X2 are fully 
aliased and any column of Xi is fully aliased with at least one column of X2. Let J 
denote the number of columns in Xi. Partition (3 as (|3']^, ^2)' and P*-''-* as {^f^^ , ^^2^ )', 
corresponding to Xi and X2, respectively. For j = 1, . . . ,p, let 

^(i) = = 1, • • • ,P : = 

By Lemma [21 for j = 1, . . . , J, Pj'^^ = if Xj = Xj/ and /jj'^'' = — /^jf"* otherwise, where 
j' E {J + 1, . . . ,p}. It follows that {132'^'*} can be viewed as an OEM sequence for solving 

P-J 

min||Y-Xef + 2^|^j|, (41) 
^ i=i 

where 9 = {61, . . . , Op^j)', and the columns of X are u{J + l)xj+i, . . . ,a;(p)xp. Because the 
objective function in ( HTl) is strictly convex, by Theorem HI {|32'^''} converges to a limit with 
grouping coherence. This completes the proof. □ 

Example 7. Consider X in Example HI Let Y = (2, 1, —4, L5)'. Using an initial point 
P^^-* = 0, the OEM sequence of the lasso with A = 1 converges to 

P = (-0.5625, 0.4375, -0.6875, 0.6875, -0.4375, 0.5625)', 

which has grouping coherence. For the same data and the same initial point, the coordinate 
descent sequence converges to 2(— 0.5625, 0.4375, —0.6875, 0, 0, 0)', which does not have 
grouping coherence. 

Theorem [7] shows that, if the initial point has grouping coherence, then every limit point 
of the OEM sequence for the lasso inherits this property. It is now tempting to ask whether 
such a result holds for the OEM sequence with A = in f lT6|) . i.e., the Healy and Westmacott 
procedure. Since full aliasing is just one possible culprit for making the matrix X in ([T]) lack 
full column rank and hence X'X become singular. Theorem [8] provides an answer to this 
question for the general singular situation. 
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Let r denote the rank of X. When r < p, the singular value decomposition (Wilkinson 
1965) of X is 

x^u' \ ° \v, 
\ V 

where U is an n x n matirx, V is a p x p orthogonal matrices and Tq is a diagonal matrix 
whose diagonal elements are the positive eigenvalues, 7i ^ • • • ^ 7r, of X'X. Define 

P* = {X'X)+X'Y, (42) 

where + denotes the Moore- Penrose generahzed inverse (Ben- Israel and Greville 2003). 

Theorem 8. Suppose that X'X + A'A — jilp. If hes in the linear space spanned by 
the first r columns of V', then as A; — > oo, for the OEM sequence {P^'^-'} of the ordinary least 
squares, P^''^ P*. 

Proof. Denote D^Ip- -f^^X'X. Note that p(^+^) = li^X'Y + D^^'^K By induction, 

pw = 7fi(jp + r> + .-- + D'=-^)x'Y + r>'=p(°) 



p—r 



-V, y \ {-ir-'L 

\ y 

= TfV'f + (^^-7r^ro) + ••• + (/.- 7r^ro)-^}rf o |^Y + i)^p(°). 

^ 

As /c ^ oo, we have that 

Ip—r 
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and D''(3^°^ — 0, which imphes that 



j3W ^ V' 



r, 











1/2 











) 



C/Y = (3 . 



This completes the proof. 



□ 



The condition X' X + A' A = 71 /p holds if 5 = Jp in (EE]). 

Remark 8. Computing the Moore- Penrose generalized inverse (3 in fH2|) is a difficult prob- 
lem. Theorem |8] says that the OEM algorithm provides an efficient solution to this prob- 
lem. When r < p, the limiting matrix (3 given by an OEM sequence has the following 
properties. First, it has the minimal Euclidean norm among the least squares estimators 
{X' X)^ X'Y (Ben-Israel and Greville 2003). Second, its model error has a simple form, 
E[{^* - P)'(X'X)(P* - |3)] = ra^. Third, Xa = implies a'p* = for any vector a, 
which immediately implies that P has grouping coherence. 

Example 8. Use the same data and the same initial point in Example [3 The OEM sequence 
of the ordinary least squares converges to 



For the regularized least squares method with the SCAD penalty, finding a local solu- 
tion to achieve the oracle property is a well-known open problem. We have proposed an 
algorithm, called the OEM algorithm, to fill this gap. For the SCAD and MCP penalties, 
this algorithm can provide a local solution with the oracle property. The discovery of the 
algorithm is quite accidental, drawing direct impetus from a missing-data problem arising in 
design of experiments. The introduction of the active orthogonization procedure in Section [3] 



|3* = (-0.6875, 0.5625, -0.8125, 0.8125, -0.5625, 0.6875)' 



which has grouping coherence. 



7 DISCUSSION 
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makes the algorithm apphcable to very general data structures from observational studies 
and experiments. Recent years have witnessed an explosive interest in both the theoretical 
and computational aspects of penalized methods. Our introduction of an algorithm that not 
only has desirable numerical convergence properties but also possesses an important theo- 
retical property suggests a new interface between these two aspects. Subsequent work in 
this direction is expected. The active orthogonization idea is general and may have potential 
applications beyond the scope of the OEM algorithm, such as other EM algorithms (Meng 
and Rubin 1991; Meng and Rubin 1993; Meng 2007), data augmentation (Tanner and Wong 
1987), Markov chain Monte Carlo (Liu 2001), smoothing splines (Wahba and Luo 1997; Luo 
1998), mesh- free methods in numerical analysis (Fasshauer 2007) and parallel computing 
(Kumar, Grama, Gupta, and Karypis 2003). This procedure may also has intriguing math- 
ematical connection with complementary theory in design of experiments (Tang and Wu 
1996; Chen and Hedayat 1996; Xu and Wu 2005). 

The result on the oracle property in Section H] uses the assumption that the sample size n 
goes to infinity. This result is appealing for practical situations with massive data (SAMSI 
2012), such as the data deluge in astronomy, the Internet and marketing (the Economist 
2010), large-scale industrial experiments (Xu 2009) and modern simulations in engineering 
(NAE 2008), to just name a few. For applications like micro-array and image analysis, one 
might be interested in extending the result to the small n and large p case, like in Fan and 
Peng (2004). Such an extension, however, poses significant challenges. Even for a fixed p, 
the penalized likelihood function for the SCAD can have a large number of local minima 
(Huo and Chen 2010). When p goes to infinity, that number can be prohibitively large, 
which makes it very difficult to sort out a local minima with the oracle property. 

In addition to achieving the oracle property for nonconvex penalties, an OEM sequence 
has other unique theoretical properties, including convergence to a point having grouping 
coherence for the lasso, SCAD or MCP and convergence to the Moore-Penrose generalized 
inverse-based least squares estimator for singular regression matrices. These theoretical 
results together with the active orthogonization scheme form the main contribution of the 
article. 
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A computer package for distributing the OEM algorithm to the general audience is under 
development and will be released. We now remark on the acceleration issue and directions 
for future work. The algorithm can be speeded up by using various methods from the EM 
literature (McLachlan and Krishnan 2008). For example, following the idea in Varadhan and 
Roland (2008), one can replace the OEM iteration in f lT4|) by 

p(fc+i) = pW_2^r + 72v, 

where r = M{^^ v = M(MO('^))) - M(|3 (''))- r and 7 = -||r||/||v||. This scheme 
is found to lead to significant reduction of the running time in several examples. For problems 
with very large p, one may consider a hybrid algorithm to combine the OEM and coordinate 
descent ideas. It partitions |3 in ([1]) into G groups and in each iteration, it minimizes the 
objective function / in (!35l) by using the OEM algorithm with respect to one group while 
holding the other groups fixed. Here are some details. Group (3 as (3 = ((3'^, . . . , (3|^)'. For 
k = 0,1, . . ., solve 

13(^+1) = argmin/(|3f . . . , |3^'_V\ |3„ ^ff) for ^ = 1, . . . , G (43) 

by OEM until convergence. Note that fHSj) has a much lower dimension than the iteration 
in f|T4l) . For G = 1, the hybrid algorithm reduces to the OEM algorithm and for G = p, it 
becomes the coordinate descent algorithm. Theoretical properties of this hybrid algorithm 
will be studied and reported elsewhere. 

Extension of the OEM algorithm can be made by imposing special structures on regression 
matrices, such as grouped variables (Yuan and Lin 2006; Zhou and Zhu 2008; Huang, Ma, 
Xie, and Zhang 2009; Zhao, Rocha, and Yu 2009; Wang, Chen, and Li 2009; Xiong 2010), 
mixtures (Khalili and Chen 2007) and heredity constraints (Yuan, Joseph, and Lin 2007; 
Yuan, Joseph, and Zou 2009; Choi, Li, and Zhu 2010), among many other possibilities. 

APPENDIX: PROOF OF THEOREM □ 
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Proof. We first give some definitions and notation. Let $ be the cumulative distribution 
function of the standard normal random variable. For a > 2 and A in ([3]) and c? ^ 71 in 
Assumption [H define 



s{u] A) = < 



sign{u)(\u\ — \) ^/d, when \u\ ^ {d + l)\, 

sign(M)|(a — 1)|m| — aA}/ |(a — l)d — l}, when {d + 1)A < \u\ ^ adX, 
u/d, when \u\ > adX, 



Let s(u; A) = [s{ui; A), . . . , s{up] A)] .' The OEM sequence from (fT2|) satisfies the condition 
that (3^''+^^ = s(uW; A„/n), where 

u^'^) = (uf u^'^)')' = ^ + (dl, - ^) ^^'\ (44) 

n \ n J 

For = 1,2,..., define two sequences of events Ak = {^^2'' = 0} ^"^^ = {^i'^ = 
uf "^V4- For /i > and A; = 0, 1, . . ., let = {|||3f^ - |3i|| ^ hX^/n}. The flow of the 
proof is to first show that P{Ak), P{Bk) and P(C^) all tend to one at exponential rates as n 
goes to infinity, thus establishing Theorem [1] (i), then show that P(nf^iAj) and P(n(!j]^-Bj) 
tend to one and finally establish Theorem HJ^ii) by noting that the asymptotic normality of 
|3f ^ follows when nf^^Ai and n'^^^B, both occur. 

Step 1. Let G = X{X' X)^^ with columns gi,...,gp. Let Gi denote its submatrix 
with the first pi columns. Let r ^ denote the largest eigenvalue of X[X2X'2Xi/n'^ . 
Define 

/ l/(2r), whenr>0, 
hA= < (45) 
I +00, when r = 0. 

Let (vi, . . . , Vp J = dlp^ — n^^X[Xi and b = max{||vi||, . . . , ||vpj|}. Define 

ad , , , 

hB = ^, 46 
b 
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with a and d given in ([3]) and Assumption [H respectively. For h > 0, define 



h 

2^' 



where rj, used in Assumption |3l is the largest eigenvalue of /p^ — X[Xi/{nd). 
For C^, since (3^°) = (3 + G'e, we have that 



P(Co') = P(||G;e|| ^ hXJn) 

^ P(|gj£| ^ hXn/{ny/p^) for j = 1, . . . .p^) 



^ i-5^[i-p(|g;£|^/^Aj(nVpr)) 



pi 



1 - $ 



For y4i, note that 



P(Ai) = P(|/3f |^A„/Mforj=pi + l,...,p) 



^ 1- J2 [l-P(|g;e|^A„/M)) 

j=Pi+i 



1-2 E 

j=Pi+i 



1 - $ 



A. 



nrfcr III 



For Bi, note that 



p(Bi) 



3fi 

pi 



^ aA„/n for j = 1, . . . 



pi 



^ + aXn \ n(5j - a\. 



ncr||g,- 
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For any /i > 0, by (gH]), 



P(Cf) ^ P(Cfn5i) = P(Co^n5i; 



pi 



-E 



1 - $ 



na\\gj 



Next, consider A^, and C^, for k = 2,3, . . .. If occurs, then by 



u 



(fc-i) 



X'X|3 X't 



Thus, 



n 



dl. 



n 



-X^Xjn -X\X2/n 
-X'^Xi/n dl p-p^ — X^X 2 / n 



u 



(fc-i) 



d^t 



{k-i) 



X[Xi 



n 



[Pi - I3l' 



n 



PI' 







By (153 



P({| 



< \n/n for j = Pi + 1, . . . ,p} n Afc_i n 



= P({||n-iX'2Xi(|3i - pS'-')) + n-^X'^ell ^ A„/n} n A^-i n 
^ V{{\\n-'X'2t\\ ^ \n/n - r|||3i - pf "'^||} H Afe_i n C^^,) 
= P({||n-iX'2eK A„/(2n)||}nAfc_inCt\) 

^ P({|x;.£| ^ A„/(2V^^) for J = + 1, . . . ,p} n n Cl^;) 

> 1-2 y [1 - $ f ^A ^^ 

-[l-P(A,_i)]-[l-P(C7,^-J], 
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where Ha is defined in (H5|l . 



= p({M/3, + u;.(i3i-pf-^); 



-ix;.e| ^ adX^/n for j = 1, . . . n n J 



^ P({|rf/3, +n-X£| ^ - +arfA„/nfor J = 1, . . . n , , ^^^^^ 

^ P({|rf/3, +n-ix;£| ^ 6|||3i - pf^'^ll +arfA„/nfor J = 1, . . . , pi} n A^^i n C^fJ 
^ P({|rf/3j + n-^Xjt\ ^ 2adXn/n for j = 1, . . . n n C^^^) 



nc//3j + 2adXr. 



cr X 



■J I 



- $ 



n(i/3j — 2adXr, 



CT X,- 



[1 - P(A,-1 



-[l-P(C^i 



(55) 



where /is is defined in ( H6ll . 
For any /i > 0, we have that 

p(Cfc") ^ P(C,"n5fcnA,_in 

= P({||(/p, - x;xi/(nrf))((3f-') - |3i) + X\t/{nd)\\ <: hXJn} n 5, n A^^i n 
^ p({^|||3f-') - Pill + ||x;£/(nt;)|| ^ hXjn} n 5, n a^^^ n cjfj 

^ P({||X;£/(r2rf)|| ^hXn/{2n)}r}Bkr}Ak-ir}Cl^:^) 

^ P({|x;.£/(nrf)| ^ hXn/{2n^{) for j = 1, . . . n fi^ H n C.^f^) 



Pi 



> 1-2E 



1 - $ 



2ny^(j||Xj| 



[1 - P(Afc_l)] - [1 - P(5^ 



[l-P(^.'^l)]: 



(56) 



where he is defined in ( l47l) . 
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Since 1 - <l>(x) = o( exp(-x^/2)) as x +00, (I49]), ([50]) and imply that 



1 - P(A) = o(exp[-ci(A„/v^)2]), 

1 - P(5i) = o(exp(-C2n)) = o{exp[-C2iXn/Vnf]), 

l-P(Cf)=o(exp[-C3(A„/v^)^]), 

where Ci, C2 and C3 are positive constants. By induction, it now follows from flM]) . f lS^ and 
OSBjl that 

1 - P(^) = [1 - P(Afc_i)] + [1 - P(Cfc^i)] + o(exp[-C4(A„/v^)2]) 

= o(A;exp[-C4(A„/v^)^]), (57) 



where C4 and C5 are positive constants. Similarly, 



1 - P(Pfc) = o(fcexp[-C6(A„/v^)2]), 
1 - P{Cj:) = o{kexp[-cr{Xn/V^)']), 



(58) 
(59) 



where Cq and Cy are positive constants. By f l57|) . a sufficient condition for P(v4fc) — )■ 1 is 



/cexp [ - c{Xn/Vn'f] -> 



(60) 



for any c > 0, which is covered by Assumption [3l 

Step 2. Consider the asymptotic normality of |3^^^ When the events Aj._i, A^, and 
Pfc+i all occur, by ( 152|) . 



n 



n 



-pi 



(61) 
(62) 
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When the events C^^ , A^^^ = flti and E^^+i) = flS Bi all occur, by (EI 



I PI' 



(63) 



For any a G with a 0, by (EH) and ([63 



|v^a'(|3S*^^ - Pi) - ^/^a'{X[Xl)-^X[t\ 
= dn3/2|a'(x;Xi)-i((3f)-pf+'))| 
^ n-^/^XnV^-^d\\a\\\\n{X[Xi)-^\\. 



For any x G M, note that 



(T(a'S-^a)i/2 



(fc+i) 



+[1 - p(cr)] + [1 - p(cr)] + [1 - p(A('=))] + [1 - p(5 

F{V^a'iX[X^)-'X[t ^ x - n-'/'Xr.v'-'d\\a\\\\niX[X,)-'\\) - $ ( J^, 
F{V^a'iX[X^)-'X[c ^x + n-'/'Xr.r^'-'d\\a\\\\niX[X,)-'\\) - $ f . ^Jl, 



+3[l - P(Cr)] + 3[l - P(Cr)] + 3[l - P(A('=))] + 3[1 - P(5(^+^))] 



cr[a'n(x;Xi)-ia]i/2 



+ 



a[a'n(X;Xi)-ia]i/2 



a(a'SrV)i/2 

X 

a(a'Sr'a)i/2 



+6[l-P(Cy')] +3[1-P(A('=))] +3[l-P(5(^'+i))]. (64) 

Now, under Assumption [31 by f[Fr[) . f[S51) and f[5^ . f[Ml) converges to zero as n — )■ 00. This 
completes the proof. □ 
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